Day 1: Getting data into shape

Tidyverse world

Ben Fanson https://bfanson.github.io/2024DADAworkshop/ (Arthur Rylah Institute)https://www.ari.vic.gov.au/
2024-09-04

Day’s objectives

Key packages

Workshop project: Wetland Data

Workflows

Examples of workflows

mine

Paul’s

Commonalities

Advantages of workflow

“Laziness” in data analysis programming

A crucial attribute of workflow is “laziness”. As several of you will know, I will often say that I am “lazy” in my programming. I do not want to think about coding anymore than have to. I have a set workflow that I basically always follow as I showed above as well as set programming style. I want to spend my time thinking about science (and statistical models), not programming.

There are two distinct advantages to this:

Rstudio

You can find a variety of cheatsheets at https://posit.co/resources/cheatsheets/

Basics

Tips and tricks

Setting up Rproject

Rproject

Project directory/folder structure

Version Control: Git/Github [Advanced]

As part of my due-diligence, I have to mention version control. In short, version control is external software that keeps track of all your changes (so you can go back if something breaks in your code) and very helpful with collaborations (e.g. allowing you both code away and then merging those changes).

For R/Rstudio, Git/Github is the most popular. Now, Git is the version control software locally on your computer and it does the tracking of all changes In contrast, Github is an online platform in which you can upload those changes (and project folder/files) and is mainly useful for the collaboration/sharing (plus some other useful features)

Key points on Git/Github:

GIT resources to get started

Cheatsheet: https://rstudio.github.io/cheatsheets/git-github.pdf

Useful book with R focus: https://happygitwithr.com/

Programming good technique

Help!!!!!!!!

When learning and using R, you will get stuck and need help. There are variety of help sources we use on a daily basis.

  1. Built-in to Rstudio are help files. This provides an initial starting place if you are interested in how to use a specific function. You just need to put “?” before the function and press run: ?mean() or use the Help tab

  2. Google search (e.g. stackoverflow, blogs) can be really helpful for finding new functions, packages, etc.

  3. Biometrics - happy to provide guidance/tips/thoughts

  4. ARI’s QEARI - hacky hour or Teams chat

  5. AI is becoming more and more useful [I am finding the most useful]

Using AI for coding

A few AI I use regularly:

For this workshop, I will stick to Copilot given the accessibility for everyone.

For those not regularly using AI, there are a few things that can help when running your searches for help:

Tidyverse framework

Historical context

# example of getting a plot using base
  ds <- mtcars  # grab the built-in mtcars dataset
  ds <- ds[, c('mpg','hp')]   # select just the columns you want
  ds$l_100km <- 1/( ds$mpg/3.78 * 1.6 )*100 # convert mile per gallon to liters / 100km
  plot( ds$l_100km ~ ds$hp )  # plot relationship between horsepower and km/l

Basic principles

  # an example of tidyverse consistent structure
  class(mtcars)   # data.frame
  head(mtcars)    # notice the rowlabels
  View(mtcars)    # see the rownames in the spreadsheet
  as_tibble(mtcars,rownames = 'make')  # tidyverse datasets (tibbles) never have rownames

Tibbles

Pipes: %>% (magittr package)

A key component of tidyverse world is the concept of “pipes”. The pipe connects multiples steps into a single sequence of commands.

Building “sentences”

  library(tidyverse)
  head(mtcars)
  mtcars %>% head(10)  # get mtcars, pass to head function, take the first 10 rows
  
# sentence using tidy functions we will dive into below...just follow basic logic
  mtcars %>% 
    slice_head(n=10) %>%  # take first 10 rows   
    mutate( mpl = round( mpg/3.79,2 ) ) %>%  # convert mpg to miles per liter (mpl)
    select( mpg, mpl )  # keep just the mpg and mpl columns 

# placeholder "."
  mtcars %>% head(5)    
  mtcars %>% head(.,5)  # same as above but explicitly using . to denote the input

  # example where you want to input the dataset not into first argument spot
  mtcars %>% lm(mpg~hp) %>% summary()  # fails as it is trying lm(formula=mtcars, data=mpg~hp )
  mtcars %>% lm(mpg~hp, data=.) %>% summary()  # referencing the input by "."

“Odd” looking uses

With %>% it is worth being aware that you will see different variations that look odd. I show a few examples below.

  mtcars %>% head  # lack of brackets
  mtcars %>% head(n=3) %>%  bind_rows( ., .)   # multiple "." example
  mtcars %>% .$mpg   # grabs a specific column - combining tidyverse with base approaches 
  mtcars %>% {c( mean=mean(.$mpg), median=median(.$mpg) ) }  # curly bracket example using "." and $ - prevents putting . into c(., mean(.$mpg), median(.$mpg))

old vs. new: %>% vs. |>

|> is considered to be “simple” now compared to %>%

are some differences mainly associated with referring to the passed

Tip - set a shortcut for… Ctrl-Shift-M

  library(tidyverse)
  head(mtcars)
  mtcars %>% head()
  mtcars |> head() # newer version 

# examples of differences (advanced)
  mtcars %>% head # works
  mtcars |> head  # fails  (must have brackets)
  mtcars |> head()  # works  (must have brackets)

  #subselecting
  mtcars %>% .$mpg
  mtcars |>  .$mpg  # $ not supported
  mtcars |>  pull(mpg)  # use pull

  mtcars %>% head(x=., 10) # . works for %>% 
  mtcars |>  head(x=., 10)  # . placeholder does not work
  mtcars |>  head(x=_, 10)  # _ placeholder instead
  mtcars |>  head(n=10, x=_)  # _ placeholder instead
  mtcars  %>%   head(n=10, .)  # can place without argument specified
  mtcars  %>%   head(n=10, _)  # must you argument if not in first position

Importing

Saving the data files

Excel files

Key functions

  readxl::read_xls()
  readxl::read_xlsx()

Tips and tricks

  readxl::read_xls()
  readxl::read_xlsx()

CSV/Table

#
  readr::read_csv()
  readr::read_table()

# brings in as tibble  
  ds_base <- read.csv('data/raw/mtcars.csv')  
    head(ds_base)
    class(ds_base)

  ds_tidy <- read_csv('data/raw/mtcars.csv')  
    head(ds_tidy)
    class(ds_tidy)

Database (Advanced)

Some possible packages to check:

column names

  warning('create example excel file in')
  ds <- tibble::tibble(`First column`=1:5, `% column`= runif(5) )
  janitor::clean_names( ds )

empty rows

  ds <- tibble::tibble(`First column`=c(1:5,NA), `% column`= c(runif(5),NA ), final=c() )
  janitor::remove_empty( ds )

  janitor::remove_empty( ds ) %>% 
      janitor::clean_names(  )

Advanced topic(s)

Importing lots of files with same (or similar-ish) structure

There are times that you will want to import 10s or 100s of files that are structurally the same (e.g. camera data, acoustic tagging data). I use the purrr package on almost daily basis. This package was created to help simplify the apply family of functions (e.g. apply,lapply,sapply,mapply) into a more intuitive nomenclature (like the rest of tidyverse). One website that gives a tutorial is https://jennybc.github.io/purrr-tutorial/ (but several others out there).

  v_list <- dir('data/raw/tagging',full.names = T)   # get a list of files to import.  use full.names=T to relative path
  importData <- function(x) read_csv( x ) %>% mutate(file_id=x) # create the function that imports a single file
  ds_all <- purrr::map_df( v_list, ~importData(.x) )   # loop through each filename in v_list, combine together into new dataset

Note - you could do this using for-loops but it takes more code.

Excel files and importing color metadata

First off, just don’t use cell styles (color) in excel as a way to store data. For instance, using color to indicate treatment.

That being said, you can extract that information using tidyxl package. I have had good luck with

Hands-on

After setting up the folder structure for your project, you need to add your data. Here, you will download the data/ directory using link below:

After the zip file downloads, unzip and drag the data/ folder to your Rproject folder.

XXXX data/raw, data/rds, data/spatial XXXX

Your Task

Import your data

Data cleaning

checking your data - each column

know the design so you can check for issues

notes on missing values

NA and NaN in R

other similar values: NaN, Inf

1/0

useful functions

is.na(), is.Nan(), is.Inf()

NAs in other functions

  1. options: delete explicitly (missing at random), imputation techniques (mean)

Joins

  knitr::include_graphics("lubridate.pdf")

knitr::include_graphics(“dplyr.pdf”)







Note: joins = merges (synonyms)

Project step

asdf

types of joins

Tips and tricks

Your Task

Restructuring

  knitr::include_graphics("lubridate.pdf")

knitr::include_graphics(“tidyr.pdf”)







Your Task

String manipulation

  knitr::include_graphics("lubridate.pdf")

knitr::include_graphics(“stringr.pdf”)







Your Task

Date/time

  knitr::include_graphics("lubridate.pdf")







Key date concepts

date using lubridate

# tidy way to make a date
  as_date('2001-01-01') # assume Year-month-day
  ymd('2001-01-01')  # better to be specific using helper function
  dmy('01-01-2021') # date the other way
  
# dates are stored as number of days since 1970-01-01 (julian days) in R 
  dt <- as_date('1970-01-01')
  class(dt)
  as.numeric(dt)  # days since 1970-01-01 is zero
  as_date('2024-01-01') %>% as.numeric() # 19,723 days since that reference date
  
# EXCEL WARNING: dates are stored as number of days since 1899-12-30 in Excel [leap year bug in Lotus]
  ds <- readxl::read_excel('data/raw/date_example.xlsx') %>% select( starts_with('dt_') )
  head(ds)  # notice the mix of numbers and text...[reminder: databases as so much better than excel]
            # notice the <s3: POSIXct> column type
  ( v_dt <- slice_head(ds, n=4 ) %>% pull(dt_excel) %>% as.numeric() )
  as_date(v_dt)  # obviously wrong because it is using 1970-01-01 as the reference date
  as_date(v_dt,origin='1899-12-30') 
  
  # switch from POXICxt to Date
  mutate(ds, dt_correct = as_date(dt_correct) ) # now it is <date>
  
  # for those interested, one possible fix to the column
  mutate(ds, dt_fixed = case_when( !is.na(as.numeric(dt_excel)) ~ as_date( as.numeric(dt_excel), origin='1899-12-30'),
                                   TRUE ~ dmy( dt_excel)  ) )
  # what does AI do? Try "using tidyverse in R, how can I fix a dataset that has a column that is text that needs to be converted to date but the column has julian days as well as date character formats in it?"


# lots of useful date helpers in lubridate  
  dt <- as_date('2024-09-16')
  year(dt)  # year
  month(dt) # month
  week(dt)  # week in the year
  wday(dt)  # day in the week (monday=2)
  yday(dt)  # day in the year
  leap_year(dt)  # is this year a leap year?
  
#   
  
# tip using built in month vectors
    month.abb[ month(dt) ] # example to get the month abbreviation
    month.name[ month(dt) ] # example to get the month fullname

lubridate vs. base

Of course, there was a base way of making as.Date(). as_date() was created to fix a few pitfalls with as.Date(), so it is safer/better to just use as_date()

# tidy vs base...  
  as_date('01-11-2021')  # gives error - tidyverse
  as.Date('01-11-2021')  # note the issue - base R

# note assumptions still happen but tidy is a bit safer  
  as_date('01-01-01') # assumes Year-month-date and that we are 20XX
  as.Date('01-01-01') # no chance of correct
  
  dmy('01-11-2021')  # can you the helper functions to convert
  as_date('01-11-2021', format='%d-%m-%Y') # or add a format for those inclined
  ?strptime  # one place to find format codes
  # copilot - try "what is the date format code for 01-nov-2022 using R"  
  
# timezone stuff-ups  
  dt_utc <- ymd_hms("2024-09-01 00:50:50")
  dt_europe <- ymd_hms("2024-09-01 00:50:50", tz = "Europe/London")
  c(as_date(dt_utc), as.Date(dt_utc))
  c(as_date(dt_europe), as.Date(dt_europe) )

  as.Date(dt_europe, tz= tz(dt_europe) ) # have to grab the tz explicitly

Key datetime concepts

It is very analagous to date concepts so let’s mirror our previous steps

# tidy way to make a datetime
  as_datetime('2001-01-01 10:00:00') 
  ymd_hms('2001-01-01 10:00:00')  # better to be specific using helper function
  dmy_hm('01-Nov-2001 10:00')  # better to be specific using helper function

# dates are stored as number of seconds since 1970-01-01 00:00:00 in R 
  dttm <- ymd_hms('1970-01-01 00:00:60')
  as.numeric(dttm)  # 60 seconds

# however, dates are stored as number of days since 1899-12-30 in Excel [leap year bug in Lotus]
  ds <- readxl::read_excel('data/raw/date_example.xlsx') %>% select( contains('tm_'))
  head(ds)  # notice the mix of numbers and text...[reminder: databases as so much better than excel]
            # notice the <s3: POSIXct> column type
  ( v_dt <- slice_head(ds, n=4 ) %>% pull(dt_excel) %>% as.numeric() )
  as_date(v_dt)  # obviously wrong because it is using 1970-01-01 as the reference date
  as_date(v_dt,origin='1899-12-30') 
  
  # switch from POXICxt to Date
  mutate(ds, dt_correct = as_date(dt_correct) ) # now it is <date>
  
  # for those interested, one possible fix to the column
  mutate(ds, dt_fixed = case_when( !is.na(as.numeric(dt_excel)) ~ as_date( as.numeric(dt_excel), origin='1899-12-30'),
                                   TRUE ~ dmy( dt_excel)  ) )
  # what does AI do? Try "using tidyverse in R, how can I fix a dataset that has a column that is text that needs to be converted to date but the column has julian days as well as date character formats in it?"

Saving the clean data

At the end of the import and clean stage, I save the dataset as an RDS file.

# to save an RDS file
  saveRDS( list(), 
          'data/rds/ards_main.rds')  # ards = Analysis-ready dataset

# to read an RDS file
  readRDS('data/rds/ards_main.rds')

Revisiting day’s objectives

Advanced topics

Quality check packages

If you know that you will be updating the import of files repeatedly, it can be worthwhile to do some upfront QC/QA checks to ensure that the data coming in is clean. There are some packages to help with that. There is a bit of overhead in both learning the package and implementing.

validate package

The validate package is intended to make checking your data easy, maintainable, and reproducible. A few of the attributes:

pointblank package

*According to package it can methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.*

I have used before but

Big Data

When data starts to get big (millions of row), tidyverse can struggle with speed. This is huge topic but will direct you to a couple useful packages that allows you to just learn dplyr but use other quicker processes

Hands-on component

Your Task Setup an Rproject for the workshop study

Bring in the second data file

Resources

Obviously, we have only scratched the surface of the topics here. Here are couple books I have used…

For a good e-book going into more details on the topics here (plus lots else)… R for Data Science

For those wishing to up their game, see Advanced R